Variable importance

Neural Networks

Carolina Musso
Rafael Rocha
Stefan Zurman
Vitor Borges

What to expect

  • Introduction
    • Interpretability of models
  • Methods for model interpretation
  • Example:
    • SHAP
    • LIME
    • Causal Inference
  • Conclusions

Introduction

Introduction: Variable importance

  • Interpretability in a Data Driven World

  • Sometimes you do not care why a decision was made. In other cases, knowing the ‘why’ is important.

  • Human desire to find meaning in the world.

Introduction

  • Performance vs opaque models.

  • ML models pick up biases from the training data.

  • ML can be debugged and audited if interpretable.

Introduction: Interpretability

  • Verify if the accuracy results come from artifacts: validation.

  • Important in health and social sciences: accountability.

  • Exploration and analysis in the sciences: extract insights from complex systems.

  • The degree to which a human can understand the cause of a decision or can consistently predict the model’s result.

Introduction: The classic approach

  • Models inherently interpretable: linear regression, logistic regression, decision tree…

    • Coefficients: rates, odds ratio…
  • Lower predictive performance in comparison to other machine learning models.

  • However, insights are hidden in increasingly complex models.

Introduction: Interpreting DNNs

Methods for model interpretation

Classic approach

Subset selection:

  • Best, Foward, Backward SW.
  • Mallows’ Cp, adjusted R², AIC, BIC.

Classic approach

Shinkage:

  • Ridge, Lasso

Classic approach

Dimension Reduction:

  • PCA…

Other techniques

  • Model Agnostic:
    • Global: Partial Dependence Plots, Acumulated Local Effects, Feature Interaction …
    • Local: LIME, SHAP
  • DNN: Leanerd features, Saliency Maps, Sensitivity Analysis, Taylor Decomposition, Layer Relevance propagation

Interpretability Methods layer

Model-Agnostic Methods: PFI

Permutation feature importance

  • Permuted the feature’s values: breaks relationship between feature and outcome.

  • A feature is “important” if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction.

Neural Networks Methods

  • NN learn features in hidden layers: special tools to uncover them.
  • The gradient can be utilized for more computationally efficient than model-agnostic methods that look at the model “from the outside”.
  • Learned features, Saliency Maps, Layer-wise relevance propagation (LRP), Influential Instances

Learned Features:

  • What features has the neural network learned?

  • Activation maximization (AM): the input that maximizes the activation of that unit.

Learned features

  • Network dissection

Pixel Attribution:

  • Sensitivity A., Taylor Decom., Saliency Maps
  • How did each pixel contribute to a particular prediction?
  • Perturbation-based: SHAP, LIME
  • Gradient-based: Vanilla Grad., DeconvNet, Grad-CAM

Pixel Attribution (Saliency Maps):

Layer-wise relevance propagation

  • LRP: Backward propagation technique

Package NeuralNet Tools

pacman::p_load(NeuralNetTools, tidyverse,
               nycflights13, nnet)

tomod <- flights %>% 
  filter(month == 12 & carrier == "UA") %>%
       select(arr_delay, dep_delay, dep_time, 
              arr_time, air_time, distance) %>% 
  mutate_each(funs(scale), -arr_delay) %>%
  mutate_each(funs(as.numeric), -arr_delay) %>%
  mutate(arr_delay = scales::rescale(arr_delay, to = c(0, 1))) %>%
  data.frame()

mod <- nnet(arr_delay ~ ., size = 5,
            linout = TRUE, data = tomod,
               trace = FALSE)

plotnet(mod)

Aplicação SHAP

Cooperative Game Theory

  • The Shapley value is a solution concept in cooperative game theory.
  • It was named in honor of Lloyd Shapley, who introduced it in 1951 and won the Nobel Memorial Prize in Economic Sciences for it in 2012.

Lloyd Shapley

Cooperative Game Theory

  • Players cooperate in a coalition and receive a certain profit from this cooperation.
  • Some players may contribute more to the coalition than others.
  • How important is each player to the overall cooperation, and what payoff can he or she reasonably expect?
  • Shapley value is a method for assigning payoffs to players depending on their contribution in the total payoff.

Cooperative Game Theory

  • It is contextualized in a cooperative game with \(N\) agents in a coalition. Each agent has only two choices, cooperate or not cooperate. Therefore the number of possible coalitions is \(2^{N}\).

  • The coalition is a subset of the set of agents \(N\) and is represented by \(S\). The set of all possible coalitions is represented by \(\mathcal{P}(N) \Rightarrow S \in \mathcal{P}(N)\).

  • The function \(v: S \rightarrow \mathbb{R}\), assigns to the coalition \(S\) a value that corresponds to the sum of the expected payoffs that the members of the coalition can obtain.

Cooperative Game Theory

  • The function \(\varphi_{i}(v)\) returns a ‘fair’ proportion of distributing the coalition payoff according to the individual contribution of each agent. This function is defined as follows:

\[ \varphi_{i}(v) = \frac{1}{n} \sum_{S \subseteq N / \{i\}} \binom{n-1}{|S|}^{-1} (v(S \cup \{ i \}) - v(S)) \]

Cooperative Game Theory

  • One way to interpret what is being explained in the formula is that the Shapley value of an agent is the average marginal contribution of the agent to all possible coalitions:

\[ \begin{align} & \varphi_{i}(v) = & \\ & \frac{1}{number \space of \space agents} \sum_{coalition \space that \space excludes \space i} \frac{marginal \space contribution \space of \space i \space for \space this \space coalition}{number \space of \space coalitions \space that \space exclude \space i \space with \space this \space size} \end{align} \]

Cooperative Game Theory: Example 1

  • Explaining a linear regression model

  • Linear regression model on the California housing dataset.

  • 20,640 blocks of houses across California in 1990, where our goal is to predict the natural log of the median home price from 8 different features:

Cooperative Game Theory: Example 1

  • MedInc - median income in block group
  • HouseAge - median house age in block group
  • AveRooms - average number of rooms per household
  • AveBedrms - average number of bedrooms per household
  • Population - block group population
  • AveOccup - average number of household members
  • Latitude - block group latitude
  • Longitude - block group longitude

Cooperative Game Theory: Example 1

  • MedInc = 0.45769
  • HouseAge = 0.01153
  • AveRooms = -0.12529
  • AveBedrms = 1.04053
  • Population = 5e-05
  • AveOccup = -0.29795
  • Latitude = -0.41204
  • Longitude = -0.40125

Cooperative Game Theory: Example 1

Cooperative Game Theory: Example 1

Cooperative Game Theory: Example 1

Cooperative Game Theory: Example 1

Cooperative Game Theory: Example 2

  • ImageNet 1000 spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images.
  • Use ResNEt50 to learn and predict this classes then explain the results using SHAP.

Cooperative Game Theory: Example 2

Cooperative Game Theory: Example 2

Applications: Machine Learning

  • Replicating Tang et al. (2020) in evaluating the amount of predictbility that a single data point contributes to the power of a Deep Learning Model.
  • The article explores shap as a method for quantifying the contribution of each data point on the training set of a convolutional neural network.
  • The neural network trained in the article uses X-ray images of the lung to diagnose pneumonia.
  • One features of the X-ray dataset is the presence of misclassification of the images.

Applications: Machine Learning

  • Pixels of the same image were summed, resulting in a single value for each image.
  • The experiment iteratively removes the x% best data, and retrains the network to measure the effects of this on the evaluation metrics.
  • Tang et al. (2020) manages to obtain a causal relationship between shap values and the accuracy of the model, reaching an efficiency of 70% at best.

Applications: Machine Learning

Applications: Machine Learning

  • Our application: Images of a section of the Candlesticks chart, from different time scales, were used for S&P500 assets.
  • The purpose of the network is to classify these images according to the price variation 5 periods after the observed pattern, the classes are ‘buy’ and ‘sell’.
  • The image pixels will collaborate with the classification of the image, and the Shapley values will be used to measure the importance of each pixel in the classification.

Applications: Machine Learning

Disclaimer: Don’t invest using candlesticks charts!!!

Applications: Machine Learning

  • The training set of the main model contains 700 images, and the test base 300 images.
  • The image corresponds to the price variation per share of ‘American Airlines’ between the dates of 2008-12-05 and 2009-03-13, each candle represents one of the weeks in the period.

Applications: Machine Learning

  • Using the shap library from Python, the Shapley values were calculated for each pixel in every image in the training set.

Applications: Machine Learning

  • Results of removing the x% worst data from the training set.

Applications: Machine Learning

Applications: Machine Learning

  • This is because the low-shapley medical data, which probably had classification problems, were removed from the training base and the network was able to learn only the true features of the well-classified images.
  • A characteristic of the data that was used in the empirical exercise is the absence of classification errors.

Aplicação LIME

LIME

Local

  • Instance-based

Interpretable

  • Understandable to humans

Model-Agnostic

  • Works for any model

Explanations

  • Explains the models output

Motivation

  • Explain what led the ML model to give a certain prediction to an instance

  • What variables affected the decision?

  • Observe the behavior of the ML model with points around the instance of interest

  • Simulate points in the neighborhood

  • Create an interpretable surrogate model to explain the behaviour of the ML model around the instance

    • Linear Regression, Logistic Regression, GLM, GMA, Decision Tree, etc.

Motivation

Step 1

Select an instance of interest

Step 2

Generate new instances around the original instance and calculate their result with the ML model

Step 3

Apply a weight to the new instances depending on the distance to the original instance

Step 4

Obtain a new interpretable model with the weighted instances

Step 5

Interpret the local model

Obtaining the interpretable model

\(L(f,g,π_x)\) is a loss function, usually MSE, between the prediction of the ML model and the prediction of the simplified model

\(\pi_x(z)\) is a function that weights the new instances according to their distance to the original instance

\(\Omega(g)\) penalizes high complexity models

Aplications

Can be used for classification or regression models

  1. Tabular data

    1. Values usually sampled from a normal distribution for each variable
  2. Text or Image data

    1. Words or mega-pixels turned on or off according to a bernoulli distribution

Aplications in R


The downloaded binary packages are in
    /var/folders/rp/h9_9qkdd7c57z9_hytk4306h0000gn/T//RtmptM1Nt5/downloaded_packages

The downloaded binary packages are in
    /var/folders/rp/h9_9qkdd7c57z9_hytk4306h0000gn/T//RtmptM1Nt5/downloaded_packages

The downloaded binary packages are in
    /var/folders/rp/h9_9qkdd7c57z9_hytk4306h0000gn/T//RtmptM1Nt5/downloaded_packages

We shall analyse the “Biopsy” database, which is part of the “MASS” package

Aplications in R

## 75% of the sample size

smp_size <- floor(0.75 * nrow(biopsy))

## set the seed
set.seed(123)
train_ind <- sample(seq_len(nrow(biopsy)), size = smp_size)

train_biopsy <- biopsy[train_ind, ]
test_biopsy <- biopsy[-train_ind, ]

model_rf <- caret::train(class ~ ., data = train_biopsy,method = "rf", #random forest
trControl = trainControl(method = "repeatedcv", number = 10,repeats = 5, verboseIter = FALSE))

Aplications in R

model_rf
Random Forest 

512 samples
  9 predictor
  2 classes: 'benign', 'malignant' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 461, 460, 461, 460, 461, 461, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
  2     0.9765460  0.9490151
  5     0.9706712  0.9362207
  9     0.9659729  0.9259912

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

Aplications in R

biopsy_rf_pred <- predict(model_rf, test_biopsy)
explainer <- lime(train_biopsy, model_rf)
explanation1 <- explain(test_biopsy[93, ], explainer, n_labels = 1, n_features = 9)
plot_features(explanation1)

Limitations

  • Difficulties definying the neighborhood

  • Dificulties with non-linearity

  • Incorrect sampling for new instances can cause improbable instances

  • Explanations to close points may vary greatly

  • Easily manipulable to hide biases

Causal Inference and Deep Learning

Motivation

  • Given a patient who returned from Africa with a fever and body aches, the most likely AI’s explanation was malaria
  • But AI got mired in probabilistic associations

Motivation

  • Given a patient who returned from Africa with a fever and body aches, the most likely AI’s explanation was malaria
  • But AI got mired in probabilistic associations
  • The key, [Judea Pearl] argues, is to replace reasoning by association with causal reasoning

Inteligent Machines and Causal Reasoning

  • Instead of the mere ability to correlate fever and malaria, machines need the capacity to reason that malaria causes fever.
  • Then, it becomes possible for machines to ask counterfactual questions
    • How the causal relationships would change given some kind of intervention?
  • Pearl views it as the cornerstone of scientific thought.

Causal Inference

  • Mathematics has not developed the asymmetric language required to capture our understanding that if \(X\) causes \(Y\) that does not mean that \(Y\) causes \(X\)1

Causal Inference

  • Mathematics has not developed the asymmetric language required to capture our understanding that if \(X\) causes \(Y\) that does not mean that \(Y\) causes \(X\)
  • Pearl proposes a formal language in which to make this kind of thinking possible
  • The main change is the possibility to evaluate \(P(Y = y|do(X = x))\)

Pearl’s Criticism on Deep Learning

  • He did not expect that so many problems could be solved by pure curve fitting.1
  • But what next?
    • “Can you have a robot scientist that would plan an experiment and find new answers to pending scientific questions? That’s the next step.”

Ladder of Causation

Ladder of Causation

Ladder of Causation

CI and DNN Aplications

Summary of examples

  • Counterfactual explanations
  • Explaining DNN (CNN) by causal interventions
  • Deep Causal Learning
  • CI and DNN: Supply disruptions

Counterfactual

  • The “event” is the predicted outcome of an instance
  • The “causes” are the particular feature values of this instance that were input to the model and “caused” a certain prediction.

Counterfactual example

A counterfactual explanation of a prediction describes the smallest change to the feature values that changes the prediction to a predefined output.

How get a predicted Good credit risk with probability larger than 50% (against 24.2%)?

Counterfactual example

Explaining DNN (CNN) by causal interventions

  • The ability to perform arbitrary causal interventions allows […] to seamlessly capture the chain of causal effects from the input, to the filters, to the DNN outputs
  • “What is the impact of the n-th filter on the m-th layer on the model’s predictions?”
  • Using the Structural Causal Models the model’s performance post compression could be predicted without the need for retraining.

Explaining DNN (CNN) by causal interventions

Explaining DNN (CNN) by causal interventions

Deep causal learning

The three core strengths of deep learning for casual learning are:

  • 1 Causal explanations beyond RCTs;
  • 2 Strong representational and learning capabilities; and
  • 3 The ability to approximate data generation mechanisms.

Deep causal learning

Deep causal learning

CI and DNN in Supply disruptions

  1. The requirements and processes which are essential for a causality-related data mining phase;
  2. A suitable selection of features to enable a best possible forecast of supply disruptions; and
  3. Effective combination of the DNN and causal inference theory by evaluating propensity scores1.

Usefulness of CI and DNN in Supply disruptions

  • Shedding light into black-box DNN models by calculating propensity scores and estimating treatment effects
  • Identifying causal relationships between supplier characteristics and the expected delivery performance
  • Discovering lacking supplier relationships and consequently bundling supply chain risk management-related activities

Examples of conclusions

  • Country: comparing suppliers based on their origin, the strong performance of foreign suppliers is apparent. Overall, deliveries by domestic suppliers are affected by an average reliability decrease of 39.7%.
  • Product Group: raw material suppliers show a superior delivery performance (increase of 55.2%) in comparison to vendors offering operating supplies, measurement equipment, and spare parts.

Conclusions

This is the tip of the iceberg!

Thank you!!!